from pyspark.sql import SparkSession # Importing SparkSession
from pyspark.sql.types import IntegerType
import pandas as pd
# Required by plotting modules
%matplotlib inline
# Initialising Spark session
spark = SparkSession.builder.appName("Churn Analysis").getOrCreate()
Read the data Churn.csv into pyspark.
# The path where churn.csv exits
filePath = "Churn.csv"
# The churn data frame is available in churnDF variable
churnDF = spark.read.format('com.databricks.spark.csv')\
.option("header", "true")\
.option("inferSchema", 'true')\
.load(filePath)
# Check if data is read properly
churnDF.show(10)
# Checking the number of rows in given data
churnDF.count()
# Since data is not that large, I will cache it.
churnDF.cache()
Calculating summary statistics of variables.
# Lets do some descriptive statistics on this data
# Checking the schema of data
churnDF.printSchema()
# Using pandas dataframe function because it creates a prettier print
pd.DataFrame(churnDF.take(5), columns=churnDF.columns).transpose()
# Performing summary statistics of variables.
churnDF.describe().toPandas().transpose()
# importing ploting libraries
import seaborn as sb
# Ploting a pairplot, this is a combined plot of many variables
plot = sb.pairplot(churnDF.toPandas())
Observations¶
- The diagonal graphs represent the histogram of various variables.
- We can check the churn row in the above plot, for relationship of various variables with churn variable.
- We can conclude from the graph that almost all the variables have some interaction between them and churn.
- We can find a strong correlation between, Day Charge vs Day minutes, Evening Charge vs Evening minutes, Night Charge vs Night minutes and International Charge vs Intl minutes
Lets calculate correlation of the dependent variable(churn) with the independent variables.
# Independent variables
indepVariables = ['Account Length', 'VMail Message', 'Day Mins', 'Eve Mins', 'Night Mins',
'Intl Mins', 'CustServ Calls', 'Intl Plan', 'VMail Plan', 'Day Calls',
'Day Charge', 'Eve Calls', 'Eve Charge', 'Night Calls', 'Night Charge',
'Intl Calls', 'Intl Charge']
# Calculating the pearsons correlation coefficient
for var in indepVariables:
print("Correlation between Churn vs {} = {}".format(var, churnDF.corr("Churn", var)))
Insights¶
- We are calculating persons correlation coefficient to test the correlation between churn and other variable, so basically testing the linear relation between churn and other independent variables.
- We can clearly see that there is no strong linear relation between churn and any other variable.
- This insight is also confirmed by the plot above.
Null Hypothesis H0: There are no early indications available in the way a customer uses its service in predicting whether he/she is likely to churn.
Alternate Hypothesis H1: There are early indications available.
Analysis:
The Null Hypothesis H0 is disproved, as you can see from the plot in Step 4, that there is indeed relationship between churn and many other variables. So there are early indications available to predict whether he/she is likely to churn.
Therefore accepting the Alternate Hypothesis H1.
Clearly our exploratory data analysis on this data set has shown that this data has power to predict churn.
The following variables have the capability to predict churn strongly:
As there is a strong correlation between, Day Charge vs Day minutes, Evening Charge vs Evening minutes, Night Charge vs Night minutes and International Charge vs Intl minutes, I've Included only Minutes in each set.
I've not included State, Area Code and Phone Number from my Domain knowledge and also these are string fields and will not provide any value to predictions.